## [1] 1599
## 'data.frame': 1599 obs. of 14 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## $ quality.factor : Factor w/ 6 levels "3","4","5","6",..: 3 3 3 4 3 3 3 5 5 3 ...
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality quality.factor
## Min. : 8.40 Min. :3.000 3: 10
## 1st Qu.: 9.50 1st Qu.:5.000 4: 53
## Median :10.20 Median :6.000 5:681
## Mean :10.42 Mean :5.636 6:638
## 3rd Qu.:11.10 3rd Qu.:6.000 7:199
## Max. :14.90 Max. :8.000 8: 18
I’ll start by plotting a histogram of the quality variable to check how it’s distributed.
Now that I have the above histogram I’ll plot the histogram of other variables present in the dataset to check the distribution of each one. Maybe some have distributions that look like the one above? Let’s check.
fixed.acidity: Normal distribution. 20 outliers have been discarded.
volatile.acidity: Depending on the binwidth used here you can think this one is a normal distribution but it’s clearly a bimodal histogram. 21 outliers have been discarded.
citric.acid: Lots of values are equal to zero with another peak at around 0.5. Looks like a plateau distribution with some peaks at round numbers. The plot using scale_x_log10 is a left skewed distribution.
residual.sugar: Right skewed distribution. 21 outliers > 8 have been discarded. The log_10 plot shows a distribuition that is very close to the first one.
chlorides: Normal distribution with some outliers to the right. 41 outliers have been discarded in this plot.
free.sulfur.dioxide: Right skewed distribution. 4 outliers > 50 have been discarded. The log_10 plot also shows a right skewed distribution.
total.sulfur.dioxide: Right skewed distribution. 2 outliers > 200 have been discarded. In this case the log_10 plot shows a bimodal histogram.
density: Normal distribution.
pH: Normal distribution. 7 outliers have been discarded.
sulphates: Right skewed distribution. 8 outliers > 1.5 have been discarded.
alcohol: Right skewed distribution. 2 outliers have been discarded.
Looking only at the Univariate Plots above we can not say which variables had more influence over quality. We’ll more about that in the Bivariate Plots Section.
This dataset has 1599 entries of the red Portuguese “Vinho Verde” wine containing 12 variables as below:
1 - fixed acidity (tartaric acid - g / dm^3) 2 - volatile acidity (acetic acid - g / dm^3) 3 - citric acid (g / dm^3) 4 - residual sugar (g / dm^3) 5 - chlorides (sodium chloride - g / dm^3 6 - free sulfur dioxide (mg / dm^3) 7 - total sulfur dioxide (mg / dm^3) 8 - density (g / cm^3) 9 - pH 10 - sulphates (potassium sulphate - g / dm3) 11 - alcohol (% by volume) Output variable (based on sensory data): 12 - quality (score between 0 and 10)
The quality variable is based on sensory data (median of at least 3 evaluations made by wine experts). Each expert graded the wine quality between 0 (very bad) and 10 (very excellent). So quality is a qualitative (categorical) variable. The other variables are the results objective tests (e.g. PH values).
The main feature in this dataset is the quality of the wine and I’m particularly interest in finding which variable(s) had influenced the most in the quality of those wines.
The answer to that will come in the Bivariate Plots Section. Anything I say based only in the Univariate Plots would be mere speculation.
Yes, I created the variable quality.factor which is the quality variable casted into the factor format. That may help if I want to plot some box plots using the quality variable in the x axis.
I didn’t see any unusual distribution but I did see some variables with very high peaks histograms such as residual.sugar and chlorides. Other variables have distributions that look like the quality histogram such as fixed.acidity.
The objective here is to see how each of the variables in the dataset relates with quality. I’ll begin by plotting a correlation matrix using the corrplot library and check which of the variables have more chance of being related to quality.
If you look how quality correlates with other variables you will notice that alcohol and volatile.acidity have the highest correlations with the former having a positive correlation and the latter a negative correlation.
I’ll go a bit further into this and will not restrict the analysis to alcohol and volatile.acidity. Below I’m going to plot scatter plots with regression lines using linear model and to complement them I’ll plot bloxplots using quality.factor.
quality vs fixed.acidity: No correlation is shown in those plots. The linear regression line is flat, almost horizontal just like the box plot means.
quality vs volatile.acidity: Those 2 plots above confirm what we see in the correlation matrix. quality and volatile.acidity have a strong negative correlation. The linear regression line has a good negative angle and the boxplots confirm that with the means decreasing as the quality grows.
quality vs citric.acid: This relationship does not stand out in the correlation matrix but here we can see that quality and citric.acid do correlate with each other. The linear regression line has a good positive angle and the boxplots confirm that with the means growing as the quality grows.
quality vs residual.sugar: The flat horizontal regression line indicates no correlation here and the box plos medians confirm that.
quality vs chlorides: There is some correlation here but it’s too weak to be taken into consideration. Also the number of outliers with quality 5 and 6 is very high.
quality vs free.sulfur.dioxide: The is no correlation here. The linear regression line is flat, almost horizontal just like the box plot means.
quality vs total.sulfur.dioxide: No correlation. The linear regression has a small negative angle which is not confirmed by the box plot.
quality vs pH: Correlation is too weak. The linear regression line is almost flat and box plot medians have a trend but it’s not strong enough.
quality vs density: There is some correlation but it is too weak. The linear regression line indicates a direction but box plot medians doesn’t confirm it from quality=4 to quality=5. There the medians go is opposite from what the linear regression line indicates. The small number of data points with quality=3 and 4 is a good sign.
quality vs sulphates: Here I can see some level of correlation here based on the linear regression line which is not objected by the box plot medians. The number of outliers do stand out here which makes the regression not so reliable.
quality vs alcohol: The strongest correlation is here. The linear regression line has a clear trend that confirms the correlation matrix. From quality=3 to quality=4 the box plot medians go against the correlation direction but the number of data points with quality=3 to quality=4 doesn’t support it as a reason to discard the correlation.
Analyzing the plots above I noticed that quality does not correlate with most of the other variables. fixed.acidity, residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide and pH.
The plots of volatile.acidity, citric.acid, density, sulphates and alcohol show that they do correlate with quality with alcohol being the variable that correlates with quality the most followed by volatile.acidity, just like the correlation matrix had indicated.
quality vs alcohol vs density: I’ve chosen alcohol and density here because they have a high correlation (from the correlation matrix). The plot shows the trend that high quality wines tend to have high alcohol and low density but it’s too spread and with a considerate number of outliers.
quality vs citric.acid vs volatile.acidity: Again the plot shows a trend but it’s inconclusive due to the number of outliers. High quality wines tend to have low volatile acid and high citric acid.
quality vs alcohol vs volatile.acidity: The two variables that have the strongest correlation with quality are alcohol and volatile.acidity and thus this plot makes a lot of sense. The trend is clear and the number of outliers are less than with the two other Multivariate Plots above. Clearly wines of high quality tend to have high alcohol and low volatile acidity.
I decided to scatter plot the relationship between quality, other variables with strong relationships with quality, volatile.acidity and alcohol.
They confirm the correlation numbers, some with a more dense plot, others are more spread. Some have more outliers than others, of course, but they confirm the correlations listed in the Bivariate Analysis Section and graphically show their relationships with each other and with quality.
quality and alcohol correlation is very clear here. The linear regression line shows graphically this correlation.
The negative correlation between quality and volatile.acid is shown in this box plot with the means decreasing as the quality grows..
This is the surprise. This correlation didn’t stand out in the correlation matrix but when we see this box plot it’s clear that this correlation exists and is strong enough to be taken into consideration.
I’ve found that those 3 variables (alcohol, volatile.acidity and citric.acid) have the strongest correlations with the wines quality score. To take this analysis a step further I would try to create a regression model using those 3 variables to calculate the wine’s quality based on the values of those variables.
This project has been of great value as it’s challenged me to learn more about correlations, plots, R libs, ggplot features, histograms distributions and more. With this hands on experience I feel more confident in exploring other data sets.